Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: CoJaBo Date: Aug 7, 2012 4:32pm
Forum: web Subject: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Domainsponsor.com is indiscriminately removing the archived copies of the 135,000+ domains they provide parking services for. As Domainsponsor buys expired domains for resale, and is probably one of the larger companies doing this, that means that a substantial portion of all domain names will be removed from the Internet Archive when they expire. This substantially degrades the usefulness of other projects such as Wikipedia, which rely on Archive.org to preserve web sources.

This appears deliberate[EDIT: Others have pointed out that there are sites listing these lines as "bad robots", separated from the explanation on the removal FAQ; it is possible they got them from a source like that and were unaware of what they actually do], as the /robots.txt on every DS-hosted parked domain contains the specified entries to remove all prior copies of that domain from the archive, as copied from the FAQ page (http://archive.org/about/exclude.php):
User-agent: ia_archiver
Disallow: /

A list of the 135,000+ domains is available here: http://nslist.net/ns1.dsredirection.com/1
It is possible they may use other nameservers than this one, so this might not be the entire list of affected domains. Attempting to access any of these domains through the archive will result in the error "Page cannot be crawled or displayed due to robots.txt."

An email I sent to their support on June 9th has thus far gone unanswered.

This post was modified by CoJaBo on 2012-08-07 23:32:10

Reply to this post
Reply [edit]

Poster: jory2 Date: Aug 8, 2012 9:13am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

@CoJaBo:
you wrote - "As Domainsponsor buys expired domains for resale"
I would like to point out that the URL's bought buy Domainsponsor do not come furnished with the Copyright Protected Intellectual Property once used on the URL's; Domainsponsor only bought the domain name NOT the content.
NetflixDOTcom, do you really think that if the company gave up the URL and it was bought buy Domainsponsor all the content used on netflix would be owned by Domainsponsor? Or that because the Internet Archive made copies of the content directly from netflix's website the Internet Archive can somehow claim ownership?

It's pointless emailing Domainsponsor or the admins of this website for answers regarding material's rightfully owned by other people.

RE: The Internet Archive's policy and Robots.txt, who cares really? People write crazy things into company policies on a daily basis.

This website however does not delete materials simply because a webmaster placed a Robots.txt file their site, this website only blocks or excludes the content from 'public view'.

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Aug 21, 2012 1:19pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

I still do not understand how, exactly, this is relevant. The issue I've brought up is about a company that, knowingly or not, is removing tens of millions of potentially useful pages from the internet archive. This has nothing to do with intellectual property- if you want to discuss IP issues, please start a new thread for it.

Reply to this post
Reply [edit]

Poster: jory2 Date: Aug 21, 2012 1:39pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

What's relevant is the content CoJaBo. This website (the Internet Archive) should have asked for permission in the first place; before making copies of websites owned and operated by other people. Whether the pages collected are/were "potentially useful" is irrelevant.

Reply to this post
Reply [edit]

Poster: PDpolice Date: Aug 26, 2012 6:05am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Your argument suggests all libraries should be closed.
As a researcher I would prefer a copy of the domains are maintained out of reach of the history revisers.

Reply to this post
Reply [edit]

Poster: Euph0ria Date: Aug 26, 2012 3:15am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

This is very unsettling news, as it sets a dire prescident I believe. Does this mean, as of now, because Domainsponsor sets a rule in a newly owned domain, this will potentially cause the owner or the content of the previously owned content to have that content removed from the archive?
And is this, as to date, still going to cause those 135k domains to be removed from the internet archive, still?

Worse case scenario, is there a way to discover the list of sites flagged in this way, and be able to independently archive, the internet archive?

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jun 13, 2012 9:55am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

This strikes me as a conflict between the idea of domain names changing hands (being bought & sold, expiring and later being registered by someone else, etc.), and the idea that control of robots.txt on a given domain grants control over the entire history of that domain, stretching back to when it was under other ownership.

Is there any way to indicate, in robots.txt, or elsewhere, "don't crawl this site any more, but leave the old history intact"?

Reply to this post
Reply [edit]

Poster: hwhankins Date: Jul 13, 2012 3:53pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

I've run into this problem a lot. As an amateur radio operator I look up things on the web and there's a lot of good info on small amateur radio club sites and individual's sites that the domains have expired. Sometimes I can get the info through the archive but lately more and more of the time I'm running into the problem of the new owner of the domain (usually a domain parking service)using a robots.txt to delete the archive copy of the prior site.

This archive.org policy of allowing a new owner of a domain to destroy the archive of the previous owner's site is the same as allowing them to burn down a public library. The new owner of a domain has no legal right to control the previous owner's content.

Back deleting the archives should require a notarized statement under penalty of perjury that they own the content they're asking to be deleted.

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 13, 2012 11:48pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

But who's responsibility is this problem? The (new) domain owners aren't trying to erase the archives, they're (presumably) trying to get their current site not to be crawled, which as I understand it is the purpose of the robots.txt standard. I've looked at all the documentation linked from robotstxt.org, and none of them mention deleting archives based on robots.txt entries.

I can imagine that archive.org needs policies and procedures for owners of content to request its removal; perhaps they could extend robots.txt to add a "Delete:" directive (with syntax otherwise identical to "Disallow:")? So "Disallow:" would mean "don't crawl", and "Delete:" would mean "don't crawl, and forget you ever saw it".

For most search engines, there wouldn't be much difference (because they frequently re-crawl, and only serve search results from the most recent crawl), but for archive.org or other similar archives, the difference would be huge.

Now, it's entirely possible that many of these domain owners really do want the history of their domains deleted; then the problem is distinguishing between the current owner of a domain deleting history of content they provided, and the current owner of a domain trying to delete history of content a previous owner provided.

Note that the change of ownership of a domain doesn't automatically include change of ownership of copyright in the content previously served from the domain. But to fully respect that principle, archive.org would have to track domain ownership (which can be tricky, between whois privacy, and things like shell companies and subsidiaries blurring the concept of "same owner").

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Jul 25, 2012 6:56am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Just thought I'd bump/clarify this; Archive.org does have a special "delete" directive, as pointed out in the first post. A site that wishes to erase its content has to copy those specific lines from the removal FAQ page to erase already archived content, it doesn't happen just by disallowing all robots. The issue here is that DomainSponsor is adding this delete directive, unilaterally, to hundreds of thousands of domains it buys; since DS is one of the largest expired domain resellers, this probably accounts for a significant percentage of ALL domains that have, or will ever, expire.

Reply to this post
Reply [edit]

Poster: jory2 Date: Jul 25, 2012 10:01am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

@CoJaBo - where is this "special "delete" " directive you're referring to? From what I read this website's policy(s) directs to no such "delete" command.

Words like: Block / Exclude / Disallow ... appear throughout this company's office policy(s), but I have never come across the word *DELETE*; have you?
In fact what appears to be said in their policy is that a simple block from free-public view is enacted.




This post was modified by jory2 on 2012-07-25 17:01:52

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 25, 2012 9:49am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

What "delete directive"?

Are you referring to:
User-agent: ia_archiver
Disallow: /

My understanding of the robots.txt standard is that that says "ia_archiver, don't crawl any part of this site." The robots.txt standard is completely silent on the subject of archives (because the standard was designed for search engine crawlers, which only publish the results of their most recent crawl).

How should I specify "don't crawl my site (so new content doesn't get added to the archives), but don't delete any old archives"? My understanding is DS is using the robots.txt protocol to indicate their desire that their domains' current content not be crawled or archived; I doubt they care about archives of their domains' previous history (before they bought them).

Is there a different User-agent string that controls just the crawl, and not the persistence of old archives?

Archive.org is choosing to allow the current owner of a domain to control the archives of the past history of that domain, even though the current owner likely doesn't have any legal ownership of the previous content.

Reply to this post
Reply [edit]

Poster: jory2 Date: Jul 25, 2012 10:43am
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

@Jeremy Leader - I wasn't referring to any Disallow, I asked CoJaBo where he/she found the delete command.
Oddly enough I have never come across one for this website.

RE: Robots.txt, as I'm sure you know respecting the robots.txt file is voluntary, any informed site owner would not rely solely on such a file if their goal was to protect their Intellectual Property from being copied without permission or consent.

For what it's worth, the domain name _ _ _.com doesn't own the copyrights to the copyright protected material's that were used on that given domain name, the rightful owner of the works remains the owner.
Should the rightful owner wish not to renew or reactivate his/her website, that certainly doesn't mean the copyrights expired along with the domain name.
When the domain name is "parked" the copyrights and all Intellectual Property rights remain the property of the rightful owner. The rights are not automatically transferred to the new owner of the domain name, or companies like Domainsponsor.com.
You wrote: "I doubt they care about archives of their domains' previous history (before they bought them)."
What exactly do you think they bought?
Wouldn't common sense tell you and anyone else that all that was bought was the URL _ _ _.com, and not all the Rights to the Works that were used on the URL previously.








This post was modified by jory2 on 2012-07-25 17:43:56

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Jul 25, 2012 1:50pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+ domains

Jory2, I believe I'm in complete agreement with you. My reference to a "delete directive" was in response to CoJaBo.

Can anyone confirm CoJaBo's claim that a directive like:

User-agent: *
Disallow: /

will NOT cause archive deletion (but will cause Internet Archive to stop crawling the site)?

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Aug 7, 2012 4:03pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

Disallowing the user agent "*" will not cause removal; it will stop crawling as expected, but only the specific user-agent specified to on the removal FAQ page will actually cause removal of past content.

I do agree that there should be something explicitly stating "Remove" in the directive to prevent such a mistake; others off-site have pointed out that some "bad bot blacklists" also include these lines without explanation of what "ia_archiver" actually is- its possible DomainSponsor got it from a source like that and didn't realize it would cause *removal* of the content from the Archive as it would have been far separated from the removal FAQ entry at that stage.

If anyone's been following the list of sites registered to their nameserver (that is, sites being removed from the Archive in this way), its increased nearly two-hundred-fold since I made this post; the current count is over 24 *million*.
I'm not sure if this indicates they are expanding that rapidly or simply that that particular index site is just catching up with their existing registrations; I suspect the latter to be more likely.

Reply to this post
Reply [edit]

Poster: Jeremy Leader Date: Aug 7, 2012 4:37pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

OK, CoJaBo, thanks for that clarification.

So there's no way to say "Internet Archive, don't crawl my site, but don't delete the archive", while still allowing other crawlers to crawl the site?

Reply to this post
Reply [edit]

Poster: CoJaBo Date: Aug 7, 2012 4:40pm
Forum: web Subject: Re: Domainsponsor.com erasing prior archived copies of 135,000+^W 24 million+ domains

It doesn't seem so; the FAQ only mentions those lines for removal, it doesn't seem to give an option for "don't crawl the site anymore, but still keep the existing content".

I had hoped someone from either DomainSponsor or the Archive would have responded to my emails by now.